Introduction

Overview and Motivation

Our inspiration was from Kaggle competition - Instacart Market Basket Analysis which is also the data sets’ resource. Instacart is a grocery ordering and delivery application. They provide an anonymized dataset that contains a sample of over 3 million grocery orders from more than 200,000 Instacart’s users, and for each user, they provide between 4 and 100 of their orders, with the sequence of products purchased in each order, the week and hour of day the order was placed, and a relative measure of time between orders (details of each data set will be introduced below).

Instacart hopes campaign participants test models for predicting products that a user will buy again, try for the first time or add to cart next during a session, which may need to use the the models such as XGBoost, word2vec and Annoy (Jeremy Stanley, May 3, 2017).

Repurchase predicting and order placement day predicting are the popular and helpful predictions among e-commerce companies. For example, it can be applied to do personalization, managing supply and demand, churn prediction, improved customer service, etc (Bigcommerce Blog, Nick Shaw). Amazon has already developed a patent called “anticipatory shipping” that can predict what and when people want to buy and ship packages even before customers have placed an order (The Economic Time, Jan 27, 2014). In this case, they can largely optimizing logistics management, human and equipment resources and inventory arrangement, so that it would help them to decrease cost and increase profit. Meantime, this type of prediction also requires much more information of customers’ behavior, such as items customers have searched for, the amount of time a user’s cursor hovers over a product, times of clicks by users, purchase conversion rate of users’ click, add to cart, collection and so on.

In this case, since there are limitation of information and we would like to apply what models we have learnt in the course, we prefer to predict the day of the week that the order will be placed. Then, this would be an additional predictor to support the demand forecasting which is useful to make a right direction in the decision-making process, like inventory arrangement, for the e-commerce platform.

Research questions

Overall, we produce a new dataset based on what we have downloaded from the competition website, and assume that:

  1. one order = one user (as the data limitation in the data set mentioned above);
  2. we have already known what customers will buy in the next time, which means we have already known the demand.

Thus, our research questions will be:

  • What day of the week a given order will be placed?
    For this question, we will use supervised methods.

  • Are there any common components between departments or aisles?
    For this question, we will use unsupervised methods.

Exploratory Data Analysis

Data Description

Table 1 - aisles

The aisle table shows aisle unique ids under aisle_id column and aisle names under aisle column, for example aisle_id of 1 represents the prepared soup salads aisle. There are 134 ids in total and n/a is not found in the table.

The aisles table
aisle_id aisle
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation
6 other
7 packaged meat
8 bakery desserts
9 pasta sauce
10 kitchen supplies
11 cold flu allergy
12 fresh pasta
13 prepared meals
14 tofu meat alternatives
15 packaged seafood
16 fresh herbs
17 baking ingredients
18 bulk dried fruits vegetables
19 oils vinegars
20 oral hygiene
21 packaged cheese
22 hair care
23 popcorn jerky
24 fresh fruits
25 soap
26 coffee
27 beers coolers
28 red wines
29 honeys syrups nectars
30 latino foods
31 refrigerated
32 packaged produce
33 kosher foods
34 frozen meat seafood
35 poultry counter
36 butter
37 ice cream ice
38 frozen meals
39 seafood counter
40 dog food care
41 cat food care
42 frozen vegan vegetarian
43 buns rolls
44 eye ear care
45 candy chocolate
46 mint gum
47 vitamins supplements
48 breakfast bars pastries
49 packaged poultry
50 fruit vegetable snacks
51 preserved dips spreads
52 frozen breakfast
53 cream
54 paper goods
55 shave needs
56 diapers wipes
57 granola
58 frozen breads doughs
59 canned meals beans
60 trash bags liners
61 cookies cakes
62 white wines
63 grains rice dried goods
64 energy sports drinks
65 protein meal replacements
66 asian foods
67 fresh dips tapenades
68 bulk grains rice dried goods
69 soup broth bouillon
70 digestion
71 refrigerated pudding desserts
72 condiments
73 facial care
74 dish detergents
75 laundry
76 indian foods
77 soft drinks
78 crackers
79 frozen pizza
80 deodorants
81 canned jarred vegetables
82 baby accessories
83 fresh vegetables
84 milk
85 food storage
86 eggs
87 more household
88 spreads
89 salad dressing toppings
90 cocoa drink mixes
91 soy lactosefree
92 baby food formula
93 breakfast bakery
94 tea
95 canned meat seafood
96 lunch meat
97 baking supplies decor
98 juice nectars
99 canned fruit applesauce
100 missing
101 air fresheners candles
102 baby bath body care
103 ice cream toppings
104 spices seasonings
105 doughs gelatins bake mixes
106 hot dogs bacon sausage
107 chips pretzels
108 other creams cheeses
109 skin care
110 pickled goods olives
111 plates bowls cups flatware
112 bread
113 frozen juice
114 cleaning products
115 water seltzer sparkling water
116 frozen produce
117 nuts seeds dried fruit
118 first aid
119 frozen dessert
120 yogurt
121 cereal
122 meat counter
123 packaged vegetables fruits
124 spirits
125 trail mix snack mix
126 feminine care
127 body lotions soap
128 tortillas flat bread
129 frozen appetizers sides
130 hot cereal pancake mixes
131 dry pasta
132 beauty
133 muscles joints pain relief
134 specialty wines champagnes

Table 2 - departments

The department table shows department unique ids under department_id column and department names under department_id column, for example department_id of 1 represents the frozen department. There are 21 ids in total and n/a is not found in the table.

The departments table
department_id department
1 frozen
2 other
3 bakery
4 produce
5 alcohol
6 international
7 beverages
8 pets
9 dry goods pasta
10 bulk
11 personal care
12 meat seafood
13 pantry
14 breakfast
15 canned goods
16 dairy eggs
17 household
18 babies
19 snacks
20 deli
21 missing

Table 3 - products

The product table shows product unique ids under product_id column and department names under product_name column, for example product_id of 1 represents Chocolate Sandwich Cookies. This table also shows aisle_id and department_id that are associated with the product as well. There are approximately 50k ids in total and n/a is not found in the table.

The products table
product_id product_name aisle_id department_id
1 Chocolate Sandwich Cookies 61 19
2 All-Seasons Salt 104 13
3 Robust Golden Unsweetened Oolong Tea 94 7
4 Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce 38 1
5 Green Chile Anytime Sauce 5 13
6 Dry Nose Oil 11 11
7 Pure Coconut Water With Orange 98 7
8 Cut Russet Potatoes Steam N’ Mash 116 1
9 Light Strawberry Blueberry Yogurt 120 16
10 Sparkling Orange Juice & Prickly Pear Beverage 115 7
11 Peach Mango Juice 31 7
12 Chocolate Fudge Layer Cake 119 1
13 Saline Nasal Mist 11 11
14 Fresh Scent Dishwasher Cleaner 74 17
15 Overnight Diapers Size 6 56 18
16 Mint Chocolate Flavored Syrup 103 19
17 Rendered Duck Fat 35 12
18 Pizza for One Suprema Frozen Pizza 79 1
19 Gluten Free Quinoa Three Cheese & Mushroom Blend 63 9
20 Pomegranate Cranberry & Aloe Vera Enrich Drink 98 7
21 Small & Medium Dental Dog Treats 40 8
22 Fresh Breath Oral Rinse Mild Mint 20 11
23 Organic Turkey Burgers 49 12
24 Tri-Vi-Sol® Vitamins A-C-and D Supplement Drops for Infants 47 11
25 Salted Caramel Lean Protein & Fiber Bar 3 19
26 Fancy Feast Trout Feast Flaked Wet Cat Food 41 8
27 Complete Spring Water Foaming Antibacterial Hand Wash 127 11
28 Wheat Chex Cereal 121 14
29 Fresh Cut Golden Sweet No Salt Added Whole Kernel Corn 81 15
30 Three Cheese Ziti, Marinara with Meatballs 38 1
31 White Pearl Onions 123 4
32 Nacho Cheese White Bean Chips 107 19
33 Organic Spaghetti Style Pasta 131 9
34 Peanut Butter Cereal 121 14
35 Italian Herb Porcini Mushrooms Chicken Sausage 106 12
36 Traditional Lasagna with Meat Sauce Savory Italian Recipes 38 1
37 Noodle Soup Mix With Chicken Broth 69 15
38 Ultra Antibacterial Dish Liquid 100 21
39 Daily Tangerine Citrus Flavored Beverage 64 7
40 Beef Hot Links Beef Smoked Sausage With Chile Peppers 106 12
41 Organic Sourdough Einkorn Crackers Rosemary 78 19
42 Biotin 1000 mcg 47 11
43 Organic Clementines 123 4
44 Sparkling Raspberry Seltzer 115 7
45 European Cucumber 83 4
46 Raisin Cinnamon Bagels 5 count 58 1
47 Onion Flavor Organic Roasted Seaweed Snack 66 6
48 School Glue, Washable, No Run 87 17
49 Vegetarian Grain Meat Sausages Italian - 4 CT 14 20
50 Pumpkin Muffin Mix 105 13

Table 4 - order_products_train

This table shows the details of all order in the training data set provided by Instacart. It shows product_ids that are purchased in each order. For example, the order_id of 1 consists of 8 products, including the following product ids: 13176, 47209, 22035, etc. Also, there are ~131k orders in total and there is no n/a in the table.

The order_products_train table
order_id product_id add_to_cart_order reordered
1 49302 1 1
1 11109 2 1
1 10246 3 0
1 49683 4 0
1 43633 5 1
1 13176 6 0
1 47209 7 0
1 22035 8 1
36 39612 1 0
36 19660 2 1
36 49235 3 0
36 43086 4 1
36 46620 5 1
36 34497 6 1
36 48679 7 1
36 46979 8 1
38 11913 1 0
38 18159 2 0
38 4461 3 0
38 21616 4 1
38 23622 5 0
38 32433 6 0
38 28842 7 0
38 42625 8 0
38 39693 9 0
96 20574 1 1
96 30391 2 0
96 40706 3 1
96 25610 4 0
96 27966 5 1
96 24489 6 1
96 39275 7 1
98 8859 1 1
98 19731 2 1
98 43654 3 1
98 13176 4 1
98 4357 5 1
98 37664 6 1
98 34065 7 1
98 35951 8 1
98 43560 9 1
98 9896 10 1
98 27509 11 1
98 15455 12 1
98 27966 13 1
98 47601 14 1
98 40396 15 1
98 35042 16 1
98 40986 17 1
98 1939 18 1

Table 5 - purchase time per order table

This table shows the purchase time (day of week under order_dow column and hour of day under order_hour_of_day column) for each order. For example, the order_id of 1187899 has order_dow of 4 and order_hour of day of 8. This means this order was maded on Thursday (order_dow = 4) at 8am (order_hour = 8).

The purchase time per order table
order_id order_dow order_hour_of_day
1187899 4 8
1492625 1 11
2196797 0 11
525192 2 11
880375 1 14
1094988 6 10
1822501 0 19
1827621 0 21
2316178 2 19
2180313 3 10
2461523 6 9
1854765 1 12
3402036 1 12
965160 0 16
2614670 5 14
3110252 4 11
62370 2 13
698604 4 13
1524161 0 13
3173750 0 9
2032076 0 20
2803975 0 11
1864787 5 11
2436259 0 12
1947848 4 20
2906490 4 22
2924697 5 18
519514 4 12
1750084 3 9
1647290 4 16
3088145 2 10
39325 2 18
13318 1 9
1651215 0 12
1019719 2 12
2989905 6 8
2639013 0 13
1072954 6 17
34647 3 19
2757217 0 11
669729 5 12
3038639 5 13
2608424 2 14
482516 4 7
3294399 4 8
1700658 6 11
21708 0 6
2178718 2 8
1734166 5 18
859654 1 10

Table 6 - user_purchases

This table joins Table 1 - 5 together. Thus, this table will include all necessary information that we need in the analysis, including order_id, purchase time, aisle and department. Please note that we do not include product_id and product_name in this table because the dimensionality is too large (over 50k categories). Thus, in our analysis, we will mainly use department_id (21 cetegories) in our analysis and use aisle_id (134 categories) in the PCA part in the unsupervised learning section.

The user purchases table
order_id order_dow order_hour_of_day aisle_id aisle department_id department
1187899 4 8 77 soft drinks 7 beverages
1187899 4 8 21 packaged cheese 16 dairy eggs
1187899 4 8 120 yogurt 16 dairy eggs
1187899 4 8 54 paper goods 17 household
1187899 4 8 45 candy chocolate 19 snacks
1187899 4 8 117 nuts seeds dried fruit 19 snacks
1187899 4 8 121 cereal 14 breakfast
1187899 4 8 23 popcorn jerky 19 snacks
1187899 4 8 84 milk 16 dairy eggs
1187899 4 8 53 cream 16 dairy eggs
1187899 4 8 77 soft drinks 7 beverages
1492625 1 11 96 lunch meat 20 deli
1492625 1 11 58 frozen breads doughs 1 frozen
1492625 1 11 107 chips pretzels 19 snacks
1492625 1 11 23 popcorn jerky 19 snacks
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 91 soy lactosefree 16 dairy eggs
1492625 1 11 46 mint gum 19 snacks
1492625 1 11 96 lunch meat 20 deli
1492625 1 11 80 deodorants 11 personal care
1492625 1 11 1 prepared soups salads 20 deli
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 69 soup broth bouillon 15 canned goods
1492625 1 11 37 ice cream ice 1 frozen
1492625 1 11 37 ice cream ice 1 frozen
1492625 1 11 37 ice cream ice 1 frozen
1492625 1 11 117 nuts seeds dried fruit 19 snacks
1492625 1 11 3 energy granola bars 19 snacks
1492625 1 11 69 soup broth bouillon 15 canned goods
1492625 1 11 69 soup broth bouillon 15 canned goods
2196797 0 11 29 honeys syrups nectars 13 pantry
2196797 0 11 24 fresh fruits 4 produce
2196797 0 11 21 packaged cheese 16 dairy eggs
2196797 0 11 66 asian foods 6 international
2196797 0 11 101 air fresheners candles 17 household
2196797 0 11 83 fresh vegetables 4 produce
2196797 0 11 66 asian foods 6 international
2196797 0 11 123 packaged vegetables fruits 4 produce

Visualization

Distribution of the order by time of purchase

The plot shows the distribution of the order by time (dow and hour).

The purchase time per order table
order_id order_dow order_hour_of_day
1187899 4 8
1492625 1 11
2196797 0 11
525192 2 11
880375 1 14
1094988 6 10
1822501 0 19
1827621 0 21
2316178 2 19
2180313 3 10
2461523 6 9
1854765 1 12
3402036 1 12
965160 0 16
2614670 5 14
3110252 4 11
62370 2 13
698604 4 13
1524161 0 13
3173750 0 9
2032076 0 20
2803975 0 11
1864787 5 11
2436259 0 12
1947848 4 20
2906490 4 22
2924697 5 18
519514 4 12
1750084 3 9
1647290 4 16
3088145 2 10
39325 2 18
13318 1 9
1651215 0 12
1019719 2 12
2989905 6 8
2639013 0 13
1072954 6 17
34647 3 19
2757217 0 11
669729 5 12
3038639 5 13
2608424 2 14
482516 4 7
3294399 4 8
1700658 6 11
21708 0 6
2178718 2 8
1734166 5 18
859654 1 10

We can observe on the left chart oder_dow that the most frequent days of ordering are Sunday’s and Monday’s comparing to the rest of the week, and on the right chart order_hour_of_day,we note a high demand of orders between 9am to 6pm.

Top 10 number of purchase by aisle

This table shows the top 10 aisles by the number of purchase. We can see that the most purchase aisles are fresh vegetables and fresh fruits (~150k orders each).

The top 10 number of purchase by aisle
aisle department total_order
fresh vegetables produce 150609
fresh fruits produce 150473
packaged vegetables fruits produce 78493
yogurt dairy eggs 55240
packaged cheese dairy eggs 41699
water seltzer sparkling water beverages 36617
milk dairy eggs 32644
chips pretzels snacks 31269
soy lactosefree dairy eggs 26240
bread bakery 23635


The number of purchase by department

This table shows the top 10 departments by the number of purchase. We can see that the most purchase aisles is produce (~409k orders).

The top 10 number of purchase by department
department total_order
produce 409087
dairy eggs 217051
snacks 118862
beverages 114046
frozen 100426
pantry 81242
bakery 48394
canned goods 46799
deli 44291
dry goods pasta 38713


Sales Patterns

Here, we would like to observe the pattern of sales in depth by splitting into departments. First, it is the pattern of weekly sales.

From these graphs, we could observe the patterns as follow:

  • Most of the departments, except Alcohol, have similar pattern. The peak of the numbers of purchase are on Sunday and Monday, and tend to decrease during the weekday, and then start to increase on Friday.
  • For Alcohol, the figure increases slightly from the trough on Monday and reaches the top on Friday, then decreases sharply on Saturday.

PCA by department

We will use PCA to analyze whether if we could reduce dimension of the data set (The number of order by department).

PCA explains the similarity of variables. There are two metrics which are correlation(scaled) and covariance(non scaled). In our analysis, we focus on the relationship between the number of order from each department and day-of-week that users purchase. Thus, we will focus our PCA analysis on non-scale, i.e. using covariance. However, it would be interesting to see the differences of the results between scale and non-scaled PCAs as well, so we will also perform the PCA analysis with correlations.

Non-scaled PCA (Covariance)

We observe that the first and second components explain 46.7% and 13.8% of variance of the data. Referring to the rule of thumb which selects the number of dimensions that allow to explain at least 75% of the variation, therefore comp1 - comp5 are selected and around 79.8% of variance of the data are explained.

Our findings are as follows:

  • Produce has the highest variation. Also, it is highly positively correlated with Dim1 and negatively correlated with Dim2
  • The other departments including the second to sixth largest variance variables(Dairy egg, Snacks, Frozen, Beverages and Pantry) are positively correlated with Dim1 and Dim2.
Variance contribution from non-scaled PCA
eigenvalue percentage of variance cumulative percentage of variance
comp 1 12.642 46.69 46.7
comp 2 3.727 13.76 60.5
comp 3 2.130 7.87 68.3
comp 4 1.577 5.82 74.1
comp 5 1.535 5.67 79.8
comp 6 1.115 4.12 83.9
comp 7 0.647 2.39 86.3
comp 8 0.610 2.25 88.6
comp 9 0.515 1.90 90.5
comp 10 0.469 1.73 92.2

Scaled PCA (Correlation)

We find that the first and second components can explain only 13.6% and 6.6% respectively, and we need 15 components (out of 21) to explain 75% of the variation. This means that correlations between departments are very low and we cannot use PCA to reduce the dimensions of the scaled data.

Variance contribution from non-scaled PCA
eigenvalue percentage of variance cumulative percentage of variance
comp 1 2.861 13.62 13.6
comp 2 1.382 6.58 20.2
comp 3 1.167 5.55 25.8
comp 4 1.049 5.00 30.8
comp 5 1.035 4.93 35.7
comp 6 1.008 4.80 40.5
comp 7 0.990 4.71 45.2
comp 8 0.972 4.63 49.8
comp 9 0.944 4.49 54.3
comp 10 0.931 4.43 58.8
comp 11 0.903 4.30 63.1
comp 12 0.874 4.16 67.2
comp 13 0.871 4.15 71.4
comp 14 0.839 4.00 75.4
comp 15 0.807 3.84 79.2

Supervised Learning

Data Preparation for Models

Before starting applying the models to the data, we have decided to aggregate the column called id_orders by department, so we could know the number of products purchased by department. In addition, we have considered to keep the column order_dow, to identify on which day of the week an order was purchased.

After creating this new table, we converted the column order_dow from numeric(int) to categorical values(factor), and to understand better this values, we change the integer values to the name of the day of the week. For example: The value “0” was transformed to “Sunday”, “1” to “Monday”, “2” to “Tuesday”, and so on.

Moreover, we have decided to split our new table into two, to ensure that the model will not overfit the data and that the results of the predictions are good. To do so, we select for the first set; our training set, 80% of the observations randomly(around 105k obs), and for the observations that remain we took them as our test set(around 26k obs).

Number of products purchased by department per order
order_dow canned.goods dairy.eggs produce beverages deli frozen pantry snacks bakery household meat.seafood personal.care dry.goods.pasta babies missing other breakfast international alcohol bulk pets
Thursday 1 3 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Saturday 0 3 3 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Saturday 0 0 6 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Saturday 0 0 4 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Wednesday 8 11 7 4 3 3 4 1 1 5 1 1 0 0 0 0 0 0 0 0 0
Friday 0 0 4 1 0 0 0 3 0 0 0 2 1 0 0 0 0 0 0 0 0
Sunday 0 2 9 0 0 0 1 3 1 0 0 0 1 0 0 0 0 0 0 0 0
Sunday 0 0 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Sunday 0 2 3 1 0 0 0 0 0 3 0 1 0 1 1 1 0 0 0 0 0
Wednesday 0 4 1 2 0 0 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0
Saturday 0 2 4 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
Sunday 0 0 2 0 0 1 3 0 0 1 2 0 0 0 0 0 0 0 0 0 0
Saturday 0 2 1 0 0 1 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0
Wednesday 0 4 7 0 1 0 2 0 0 0 0 0 0 0 1 0 0 2 0 0 0
Monday 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
Saturday 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Sunday 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Saturday 0 0 5 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Tuesday 0 4 19 0 0 2 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0
Thursday 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Monday 1 1 0 5 1 1 0 3 0 2 0 0 0 0 0 0 0 0 0 0 0
Tuesday 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
Thursday 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
Monday 0 2 7 10 1 1 1 0 0 0 0 0 2 0 0 0 0 1 0 0 0
Monday 1 2 2 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0
Saturday 0 0 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Wednesday 0 0 3 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Monday 0 1 4 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
Sunday 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
Friday 0 0 3 0 2 0 3 0 1 0 0 0 0 0 0 0 0 0 0 0 0
Sunday 0 0 9 3 0 0 0 0 1 0 0 2 2 0 0 0 0 0 0 0 0
Friday 1 5 3 0 0 7 0 4 3 0 1 0 3 0 0 0 1 2 0 0 0
Sunday 0 0 2 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0
Thursday 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Friday 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Wednesday 0 1 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
Monday 0 3 6 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
Wednesday 0 0 5 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
Thursday 0 0 0 5 0 0 1 3 0 1 0 0 0 0 0 1 0 0 0 0 0
Sunday 0 1 14 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Tuesday 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Monday 0 2 7 3 1 1 2 0 1 0 0 1 0 0 0 0 1 0 0 0 0
Friday 0 0 0 3 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
Tuesday 0 4 6 3 2 3 1 7 1 0 1 0 0 0 1 0 2 0 0 0 0
Monday 0 4 3 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Friday 3 1 7 1 0 0 6 0 0 0 0 1 1 0 0 0 0 0 0 0 0
Monday 0 0 3 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
Tuesday 2 0 4 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
Tuesday 0 2 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
Friday 0 3 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0

Models

Our goal is to determine which day of the week a given order will be placed. Since we have transformed the column order_dow as a factor with categorical values, we will apply models that consider a classification task.

We have chosen the models as follows:

  1. Decision Trees
  2. Random Forest
  3. Multinomial Logistic Regression
  4. Logistic Regression

In addition, we will implement to each of the models some of the following approaches:

  • One day of the week - Unbalanced data
  • One day of the week - Balanced data with Sub-sampling and Cross-Validation
  • Weekdays and Weekend - Balanced data with Sub-sampling and Cross-Validation

Decision Trees - Classification

Decision trees are algorithms that recursively search the space for the best boundary possible, until we unable them to do so (Ivo Bernardo,2021). The basic functionality of decision trees is to split the data space into rectangles, by measuring each split. The main goal is to minimize the impurity of each split from the previous one.

One day of the week - Unbalanced data

For this approach we want to measure the accuracy of the model with the unbalanced data. Furthermore, it will be interesting to see which departments were considered the best to split the data into days of the week to later be compared to a balanced data with cross-validation (second approach).

According to the pruned tree, we observe that the department produce have the most relevance within the departments, this could be influenced by the fact that this department has the highest number of products purchased in our data set. Furthermore, the tree show us that with an amount of products purchased higher or equal than 3, the model will classify the day of the week as Sunday, if it is lower than 3 the tree will split into another node containing the frozen department.

Likewise, the same procedure will be consider for this node and the following, they will start from the previous node and will try to minimize the impurity at each split. It should be noted that we cannot observed on the terminal nodes all the days of the week, because of the way in which the trees are generated. For the same reason, we expect on the prediction of the test set, a prediction value of “0” on the days of the week different to Sunday and Monday.

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#>   Friday         0      0        0      0        0       0         0
#>   Monday       694    706      578    650      708     685       661
#>   Saturday       0      0        0      0        0       0         0
#>   Sunday      2787   3228     3202   4843     2483    2538      2476
#>   Thursday       0      0        0      0        0       0         0
#>   Tuesday        0      0        0      0        0       0         0
#>   Wednesday      0      0        0      0        0       0         0
#> 
#> Overall Statistics
#>                                         
#>                Accuracy : 0.211         
#>                  95% CI : (0.207, 0.216)
#>     No Information Rate : 0.209         
#>     P-Value [Acc > NIR] : 0.2           
#>                                         
#>                   Kappa : 0.016         
#>                                         
#>  Mcnemar's Test P-Value : NA            
#> 
#> Statistics by Class:
#> 
#>                      Class: Friday Class: Monday Class: Saturday
#> Sensitivity                  0.000        0.1795           0.000
#> Specificity                  1.000        0.8217           1.000
#> Pos Pred Value                 NaN        0.1508             NaN
#> Neg Pred Value               0.867        0.8503           0.856
#> Prevalence                   0.133        0.1499           0.144
#> Detection Rate               0.000        0.0269           0.000
#> Detection Prevalence         0.000        0.1784           0.000
#> Balanced Accuracy            0.500        0.5006           0.500
#>                      Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity                  0.882           0.000          0.000
#> Specificity                  0.194           1.000          1.000
#> Pos Pred Value               0.225             NaN            NaN
#> Neg Pred Value               0.861           0.878          0.877
#> Prevalence                   0.209           0.122          0.123
#> Detection Rate               0.185           0.000          0.000
#> Detection Prevalence         0.822           0.000          0.000
#> Balanced Accuracy            0.538           0.500          0.500
#>                      Class: Wednesday
#> Sensitivity                      0.00
#> Specificity                      1.00
#> Pos Pred Value                    NaN
#> Neg Pred Value                   0.88
#> Prevalence                       0.12
#> Detection Rate                   0.00
#> Detection Prevalence             0.00
#> Balanced Accuracy                0.50

As expected, only on Sunday and Monday we get predictive values for all the days of the week, while in the rest we get zero. Overall, the accuracy of this model is low with a score of 0.21, meaning that the model have ((0.21 - (1/7))= 0.077) around 8% of accuracy classifying the days of the week. It is important to recall that there is a big difference between sensitivity and specificity because our data is not balanced.

One day of the week - Balanced data with Sub-sampling and Cross-Validation

Now for this approach, we will balanced the data with sub-sampling and make the overall score more robust by applying to the model a cross-validation technique, this will help us to find the best set of hyperparameters.

#>  .outcome  Fri Mon Sat Sun Thu Tue Wed                                    cover
#>  Saturday [.14 .13 .16 .14 .14 .14 .15] when produce <  3 & frozen >= 1     18%
#>    Sunday [.13 .15 .15 .18 .13 .13 .13] when produce >= 3                   44%
#>  Thursday [.15 .14 .13 .11 .16 .15 .16] when produce <  3 & frozen <  1     38%

The left column (.outcome) of the rules show the day that was selected for the terminal node (the one with the highest probability) and next to it the probability of each day of the week for the department selected. In this case for the last rule it seems that Wednesday and Thursday have the same probability because of the rounding, but Thursday its 0.003 above Wednesday, this can be seen from the tree plot.

The rightmost column (cover) gives the percentage of observations in each rule. The first rule says that Saturday will be chosen when the department produce is lower than 3 and higher or equal than 1 with a probability of 18%. Then we can look at the results of the model in the Confusion Matrix.

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#>   Friday         0      0        0      0        0       0         0
#>   Monday         0      0        0      0        0       0         0
#>   Saturday     678    645      751    904      595     587       583
#>   Sunday      1454   1797     1768   2992     1235    1291      1259
#>   Thursday    1349   1492     1261   1597     1361    1345      1295
#>   Tuesday        0      0        0      0        0       0         0
#>   Wednesday      0      0        0      0        0       0         0
#> 
#> Overall Statistics
#>                                        
#>                Accuracy : 0.195        
#>                  95% CI : (0.19, 0.199)
#>     No Information Rate : 0.209        
#>     P-Value [Acc > NIR] : 1            
#>                                        
#>                   Kappa : 0.035        
#>                                        
#>  Mcnemar's Test P-Value : NA           
#> 
#> Statistics by Class:
#> 
#>                      Class: Friday Class: Monday Class: Saturday
#> Sensitivity                  0.000          0.00          0.1987
#> Specificity                  1.000          1.00          0.8223
#> Pos Pred Value                 NaN           NaN          0.1583
#> Neg Pred Value               0.867          0.85          0.8591
#> Prevalence                   0.133          0.15          0.1441
#> Detection Rate               0.000          0.00          0.0286
#> Detection Prevalence         0.000          0.00          0.1808
#> Balanced Accuracy            0.500          0.50          0.5105
#>                      Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity                  0.545          0.4265          0.000
#> Specificity                  0.576          0.6382          1.000
#> Pos Pred Value               0.254          0.1403            NaN
#> Neg Pred Value               0.827          0.8894          0.877
#> Prevalence                   0.209          0.1216          0.123
#> Detection Rate               0.114          0.0519          0.000
#> Detection Prevalence         0.450          0.3697          0.000
#> Balanced Accuracy            0.560          0.5324          0.500
#>                      Class: Wednesday
#> Sensitivity                      0.00
#> Specificity                      1.00
#> Pos Pred Value                    NaN
#> Neg Pred Value                   0.88
#> Prevalence                       0.12
#> Detection Rate                   0.00
#> Detection Prevalence             0.00
#> Balanced Accuracy                0.50

From the confusion matrix we observe a better result between the sensitivity and specificity across the classes, if we compare the previous model with this one, we notice that on the class Sunday the values of the sensitivity changed from 0.882 to 0.545, and for the specificity from 0.194 to 0.576. As expected, the Accuracy has decreased from 0.211 to 0.195, meaning that the model have ((0.195 - (1/7))= 0.052) around 5% of accuracy per day of the week, but the Balanced Accuracy is better. This model would be preferred than the one used with unbalanced data.

Weekdays and Weekend - Balanced data with Sub-sampling and Cross-Validation

For the Final approach we transformed the levels of the column order_dow into two, one for the days during the week and the remaining for the weekend. On top of that we balanced our levels “weekday” and “weekend” and consider a Cross-Validation to train the model.

We try to plot the final tree computed by the model, but it was not possible to interpret it, due to the overlapping nodes shown in the graph, but we could see that the departments produce, frozen and meat.seafood, were among the first splits.

The results of the confusion matrix show us a well balanced data from what we can observed in the sensitivity and specificity. The Accuracy of the model is similar comparing it with the other two approaches, the model have ((0.533 - (1/2))= 0.033) 3% of accuracy. Overall, all different approaches have a low score at predicting the day of the week based on the department purchases from previous orders.

Random Forest

Random Forest (RF) are algorithms of a set of decision trees that will produce a final prediction with the average outcome of the set of trees considered (user can define the amount of trees and the number of variables for each node). One of the reasons that we decided to test this method is because RF are considered to be more stable than Decision Trees; more trees better performance, but certain advantages come at a price. RF slow down the computation speed and cannot be visualize, however, we will look at the results for later comparison (Saikumar Talari, 2022).

Weekdays and Weekend - Balanced data with Sub-sampling and Cross-Validation

For this method we will consider the same approach as the last one of Classification Tree. We faced some computation speed problems while running the model, for that reason we decided to considered only 10,000 orders to reduce the waiting time.

As expected, the Accuracy of the model is higher than the Classification Tree as well as Cohen’s Kappa and the balanced accuracy. This model would be preferred at predicting “weekday” and “weekend” as it has better results.

Multinomial logistic regression

One day of the week - Unbalanced data

Multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes (Wikipedia,2021). Like binary logistic regression, multinomial logistic regression uses maximum likelihood estimation to evaluate the probability of categorical membership.

Our first approach is to predict the day of the week that the order will be placed according to the product composition in the order. Since there are 7 days in a week so it is not a binary logistic regression problem but a multinomial logistic regression problem.

We select Sunday as the reference level. To build the model, we use the number of products in each department of the order as explanatory variables.

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#>   Friday       129     99       91     79       93      94       102
#>   Monday        69     87       39     54       95      89        85
#>   Saturday      39     46       50     51       36      29        24
#>   Sunday      3208   3673     3584   5278     2931    2981      2900
#>   Thursday      20     19        9     21       24      18        18
#>   Tuesday        0      0        0      0        0       0         0
#>   Wednesday     16     10        7     10       12      12         8
#> 
#> Overall Statistics
#>                                         
#>                Accuracy : 0.213         
#>                  95% CI : (0.208, 0.218)
#>     No Information Rate : 0.209         
#>     P-Value [Acc > NIR] : 0.105         
#>                                         
#>                   Kappa : 0.01          
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#> 
#> Statistics by Class:
#> 
#>                      Class: Friday Class: Monday Class: Saturday
#> Sensitivity                0.03706       0.02211         0.01323
#> Specificity                0.97548       0.98068         0.98998
#> Pos Pred Value             0.18777       0.16795         0.18182
#> Neg Pred Value             0.86882       0.85043         0.85634
#> Prevalence                 0.13267       0.14993         0.14406
#> Detection Rate             0.00492       0.00332         0.00191
#> Detection Prevalence       0.02618       0.01974         0.01048
#> Balanced Accuracy          0.50627       0.50140         0.50160
#>                      Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity                 0.9609        0.007521          0.000
#> Specificity                 0.0708        0.995444          1.000
#> Pos Pred Value              0.2149        0.186047            NaN
#> Neg Pred Value              0.8723        0.878705          0.877
#> Prevalence                  0.2093        0.121613          0.123
#> Detection Rate              0.2012        0.000915          0.000
#> Detection Prevalence        0.9358        0.004916          0.000
#> Balanced Accuracy           0.5158        0.501483          0.500
#>                      Class: Wednesday
#> Sensitivity                  0.002550
#> Specificity                  0.997100
#> Pos Pred Value               0.106667
#> Neg Pred Value               0.880408
#> Prevalence                   0.119555
#> Detection Rate               0.000305
#> Detection Prevalence         0.002858
#> Balanced Accuracy            0.499825

According to the confusion matrix, the accuracy(0.213) is low and there is a big difference between sensitivity and specificity in each class. For example, the sensitivity of class Friday is 0.037 while the specificity of class Friday is 0.975. Also the kappa(0.01) is very small which means the observed accuracy is only a little higher than the accuracy that one would expect from a random model. We try to balance the data and use a cross-validation to improve the model accuracy.

One day of the week - balanced data with cross-validation

Before balancing the data, We need to check the frequency of each class. The class Wednesday has the smallest frequency(12550). We will balance data by sub-sampling according to the frequency of class Wednesday.

#> 
#>    Sunday    Friday    Monday  Saturday  Thursday   Tuesday 
#>     21972     13925     15738     15121     12768     12896 
#> Wednesday 
#>     12550
#> 
#>    Sunday    Friday    Monday  Saturday  Thursday   Tuesday 
#>     12550     12550     12550     12550     12550     12550 
#> Wednesday 
#>     12550

We only sub-sample the data without applying cross-validation. Now every class has the same frequency(12550).

We try the cross-validation with the sub-sampling data by using the train function of caret package, but the data set is too big and it takes a very long time to run it so we decide not to include the cross-validation.

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#>   Friday       245    258      222    297      203     203       200
#>   Monday       348    448      383    629      310     347       329
#>   Saturday     403    436      459    684      348     345       336
#>   Sunday       797   1033     1064   1868      664     735       698
#>   Thursday     737    760      715    869      718     709       689
#>   Tuesday      356    371      360    453      320     282       290
#>   Wednesday    595    628      577    693      628     602       595
#> 
#> Overall Statistics
#>                                         
#>                Accuracy : 0.176         
#>                  95% CI : (0.171, 0.181)
#>     No Information Rate : 0.209         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.03          
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#> 
#> Statistics by Class:
#> 
#>                      Class: Friday Class: Monday Class: Saturday
#> Sensitivity                0.07038        0.1139          0.1214
#> Specificity                0.93923        0.8948          0.8864
#> Pos Pred Value             0.15049        0.1603          0.1524
#> Neg Pred Value             0.86851        0.8513          0.8570
#> Prevalence                 0.13267        0.1499          0.1441
#> Detection Rate             0.00934        0.0171          0.0175
#> Detection Prevalence       0.06205        0.1065          0.1148
#> Balanced Accuracy          0.50481        0.5044          0.5039
#>                      Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity                 0.3401          0.2250         0.0875
#> Specificity                 0.7594          0.8057         0.9066
#> Pos Pred Value              0.2723          0.1382         0.1160
#> Neg Pred Value              0.8130          0.8825         0.8765
#> Prevalence                  0.2093          0.1216         0.1228
#> Detection Rate              0.0712          0.0274         0.0107
#> Detection Prevalence        0.2614          0.1981         0.0927
#> Balanced Accuracy           0.5497          0.5153         0.4970
#>                      Class: Wednesday
#> Sensitivity                    0.1897
#> Specificity                    0.8388
#> Pos Pred Value                 0.1378
#> Neg Pred Value                 0.8840
#> Prevalence                     0.1196
#> Detection Rate                 0.0227
#> Detection Prevalence           0.1646
#> Balanced Accuracy              0.5143

From the confusion matrix report we can notice that there is an improvement on the difference between sensitivity and specificity of each class. For example, the sensitivity and specificity of class Thursday are 0.007 and 0.995 in the previous model. After balancing the data, now the sensitivity and specificity of class Thursday are 0.231 and 0.801. The kappa is also higher(from 0.01 to 0.03)

Logistic regression

Weekdays and Weekend - Balanced data and Cross-Validation

The logistic regression is a regression adapted to binary classification. The basic idea of logistic regression is to use the mechanism already developed for linear regression by modeling the probability pi using a linear predictor function, i.e. a linear combination of the explanatory variables and a set of regression coefficients that are specific to the model at hand but the same for all trials. The linear combination is transformed to a probability using a sigmoid function.

In order to further improve our model quality, we think about aggregating the classes of the day of week. Usually the buying behavior is different between weekday and weekend. So we separate the day of week into two classes weekday and weekend.

Now the outcome variable has only two categories so we can use the binomial logistic regression.

According to the confusion matrix the balanced accuracy is higher and the difference between sensitivity(0.62) and specificity(0.48) is even smaller. Now the Kappa is 0.10, higher than Cohen’s Kappa previous model(0.03), and the Accuracy is 0.57.

Comparing the result of this model against the previous ones, we note that Random forest model is the one that resembles these results, only Logistic regression is slightly higher on the Accuracy by 0.008, on Cohen’sKappa by 0.010, and on the Balanced Accuracy by 0.005. So, for those reasons we have decided choosing this model over the rest.

Variable Importance

Variable importance is a method that provides a measure of the importance of each feature for the model prediction quality. We analyze the variables importance of our 4 models. There are 21 explanatory variables in each of our models and we only show the top 10 most important variables in the plots.

Note: We faced some computation speed problems while running the the Variable Importance for all the models, for that reason we decided to considered the same amount of observations as the Random Forest model(10k) to reduce the waiting time.

As we note from this chart the models Classification Tree, Random Forest, and Logistic Regression use the AUC loss to compare the model quality of shuffling different variables. AUC is a synthetic measure of the distance to random model in the ROC curve plot. The larger AUC, the better the model.

According to the feature importance of this three models, the most important department is the produce department. As we have seen above in the table of number of purchases per department, the produce department is the leading one with almost twice the number of the second one in that table. This could be one of the main reasons why all three models have chosen it as the most relevant. It is important to mention that if we shuffle this variable, the AUC of the model will have the largest loss.

For the model Multinomial Logistic Regression, we use the Root Mean Square Error(RMSE) to compare the model quality of shuffling different variables. According to the plot, the most important variable is dairy eggs. If we shuffle this variable, the RMSE of the model will have the largest increase.

Unsupervised learning

In this section, we will work with unsupervised learning methods i.e. Clustering and PCA to learn how we can reduce the dimensionality of the original data set and how to group data by similarity/or dissimilarity of all the features. Then, we will also study further by using a hybrid supervised/unsupervised learning method to perform the prediction (supervised) by using the result from the PCA analysis (unsupervised).

Clustering

In this section, we will study clustering approaches, Hierarchical clustering and Partitioning methods, to find groups of instances/observations that have similar features.

Due to the limitation of the clustering functions in R, the execution time when we tried to cluster all instances (131,206 instances) was very long. Since, in this exercise, we would like to focus on the approaches/methodologies, we will randomly choose only 1% of the instances (1,312 instances) to perform the analysis in order to reduce the execution time.

Hierarchical clustering

Distance

First, we apply an Agglomerative nesting (AGNES) and compute the distances using Euclidean distances because all our features are numerical.

Example of the Euclidean distance between instances
Var1 Var2 value
1 1 0.00
2 1 9.95
3 1 6.00
4 1 9.16
5 1 7.14
6 1 10.95
7 1 7.62
8 1 8.72
9 1 9.90
10 1 6.00
11 1 4.47
12 1 11.53
13 1 7.75
14 1 7.00
15 1 11.49
16 1 11.66
17 1 2.45
18 1 9.27
19 1 11.87
20 1 7.07
21 1 9.80
22 1 11.96
23 1 9.70
24 1 10.58
25 1 9.33
26 1 6.00
27 1 8.00
28 1 9.22
29 1 9.75
30 1 10.77
31 1 11.40
32 1 5.75
33 1 10.34
34 1 5.83
35 1 9.54
36 1 10.91
37 1 9.59
38 1 9.95
39 1 8.94
40 1 6.33
41 1 11.09
42 1 8.89
43 1 11.87
44 1 9.11
45 1 12.49
46 1 11.79
47 1 11.00
48 1 9.64
49 1 15.46
50 1 7.68


> Dendrogram

We then apply the dendrogram using the complete linkage to visualized the output of the hierarchical clustering. Since it’s difficult to read the original dendrogram, we will selected the optimal number of clusters and cut the tree branches in the next steps.

Choice of the number of clusters

We will choose the optimal number of clusters from statistics. We will apply the within-cluster sum of squares, the GAP statistics and the silhouette using complete linkage on Euclidean distance.

From the graph we can interpret the following:

  • The Within-Cluster Sum of Squares: It seems there is an elbow at the cluster number 3, for that reason we choose the optimal k = 3.
  • The Silhouette: With the silhouette statistic, the larger value of the average silhouette width is the better. We can see that the optimal number of clustering is 3.
  • The Gap Statistic: We see that the Gap statistic returns 2 clusters as an optimal number of cluster.

For those reasons we have decided to choose 3 as a number of clusters and cut the trees as follows.

Interpretation of the clusters

We will analyze the clusters by using the box plot for each feature.

Our observations are as follows:

  • Cluster 2 has very small canned goods, deli, meat.seafood and dry.goods.pasta, while Cluster 1 and Cluster 3 have relatively high amounts for these departments.
  • Cluster 3 has higher snacks, beverages, daily eggs and breakfast than Cluster 1.
  • Cluster 1 has the highest produce.

Partitioning methods

In this section, we will apply partitioning methods, K-means and Partitioning Around the Medoid (PAM). For the partitioning methods, we first need to identify the number of clusters and then use the chosen number of clusters to perform the analysis.

K-means

We will use WSS, silhouette and the Gap statistic to determine the number of clusters used for K-means. It’s important to note that K-means is suitable for numerical features only. Since all our features are numerical, it’s appropriate to perform the K-means analysis.

  • The Within-Cluster Sum of Squares: It seems there is an elbow at the cluster number 2, for that reason we choose the optimal k = 2.
  • Silhouette: From this approach, k = 2 (the highest number) is the optimal number of clusters.
  • Gap Statistic: From this graph, it’s not conclusive. The function chose 13 as an optimal number but 15 and 22, which are local maximum, might be also used here.

Therefore, the number 2 is an optimal number of clusters. Afterward, we plot the box plot to distinguish the characteristic of those 2 clusters. We observe that cluster 2 has higher average numbers of purchases than cluster 1 in every department that the median of numbers of purchases are higher than zero such as canned.good, dairy.eggs, produce, beverages, deli, frozen, pantry, snacks, bakery, meat.seafood and dry.goods.pasta.

Next, we will show a scatter plot along the first and second principal components and group by 2 clusters using K-means. We will see that PC1 can distinguish the clusters quite well. Cluster 1 has higher PC1 than Cluster 2.

Regarding the principal components, the percentages of variance of PC1 and PC2 are 47% and 14% respectively, which are close to what we observed from PCA analysis in the EDA section however, please note that the numbers are not the same because we chose only 1% of the features from the original data set for this clustering exercise due to the computer capacity limitation.

Partitioning Around the Medoid (PAM)

Similar to K-means, we will need to find an optimal number of clusters before performing the analysis. We will Silhouette to determine the optimal k.

From the graph below, we find that the optimal number of clusters is 2.

Then, we will plot silhouette to show the silhouettes of all the instances and the average silhouette.

From the graph, we see that cluster 2 is well formed (well separated from Cluster 1, with the average silhouette of 0.46). Cluster 1 is less homogeneous with an average silhouette of 0.16 only. The average silhouette of the data set is 0.4.

Afterward, we plot the box plot to distinguish the characteristic of those 2 clusters. We observe that cluster 1 has higher average numbers of purchases than cluster 2. It’s interesting to see that the characteristics between 2 clusters from PAM are very similar to the clusters from K-means.

PCA and hybrid supervised and unsupervised learning approach

In our data set, there are 134 aisles which are grouped into 21 departments. So far in our supervised learning approach, we focus on the number of purchase per department. In this section, we would like to combine supervised and unsupervised learning approaches as follows.

  1. Grouping aisles by using PCA: Our assumption is that grouping aisles by PCA might better reflect a purchase pattern of customer than when grouping by department.

  2. Performing supervised learning approach from the first step

PCA by aisle

Non-scaled PCA (Covariance)

We observe that the first and second components explain 23.98% and 9.02% of variance of the data. Referring to the rule of thumb which selects the number of dimensions that allow to explain at least 75% of the variation, therefore comp1 - comp28 are selected and around 75.5% of variance of the data are explained.

Variance contribution from non-scaled PCA
eigenvalue percentage of variance cumulative percentage of variance
comp 1 4.211 23.985 24.0
comp 2 1.585 9.027 33.0
comp 3 0.889 5.061 38.1
comp 4 0.668 3.806 41.9
comp 5 0.531 3.022 44.9
comp 6 0.487 2.772 47.7
comp 7 0.442 2.516 50.2
comp 8 0.370 2.106 52.3
comp 9 0.350 1.993 54.3
comp 10 0.303 1.724 56.0
comp 11 0.293 1.671 57.7
comp 12 0.276 1.573 59.3
comp 13 0.266 1.513 60.8
comp 14 0.241 1.372 62.1
comp 15 0.217 1.233 63.4
comp 16 0.206 1.172 64.5
comp 17 0.194 1.102 65.6
comp 18 0.183 1.040 66.7
comp 19 0.181 1.029 67.7
comp 20 0.175 0.998 68.7
comp 21 0.170 0.967 69.7
comp 22 0.163 0.931 70.6
comp 23 0.152 0.867 71.5
comp 24 0.149 0.849 72.3
comp 25 0.144 0.818 73.1
comp 26 0.141 0.806 74.0
comp 27 0.139 0.790 74.7
comp 28 0.130 0.742 75.5


> Scaled PCA (Correlation)

We find that the first and second components can explain only 3.2% and 1.8% respectively, and we need 93 components (out of 134) to explain 75% of the variation. This means that correlations between aisles are very low and we cannot use PCA to reduce the dimensions of the scaled data.

Variance contribution from scaled PCA
eigenvalue percentage of variance cumulative percentage of variance
comp 1 4.27 3.186 3.19
comp 2 2.43 1.814 5.00
comp 3 1.93 1.439 6.44
comp 4 1.68 1.257 7.70
comp 5 1.56 1.168 8.86
comp 6 1.46 1.088 9.95
comp 7 1.42 1.059 11.01
comp 8 1.33 0.990 12.00
comp 9 1.27 0.948 12.95
comp 10 1.25 0.937 13.88

All in all, we can see that scaled PCA cannot reduce dimensions of the data set. Although the disadvantage of the non-scaled PCA is that it tends to include information from variables that have high variance and high correlation to the others and the model tends to neglect variables that have low variance and low correlation, we can derive the benefit of PCA, which is to reduce the dimension. Also, our objective in our research is to support demand forecasting such as inventory arrangement. The high variation departments and aisles tend to more important to this purpose. Thus, we focus on non-scaled PCA.

A hybrid supervised and unsupervised learning approach

In this section, we will apply a supervised learning approach with the output from the PCA (non-scale) analysis. From the PCA analysis, we select the first 28 principal components, which can explain over 75% of the total variation. For the supervised learning approach that will be performed in this section, we will use the logistic regression which is the best model from the supervised learning section.

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction weekday weekend
#>    weekday   10655    4978
#>    weekend    6313    4295
#>                                         
#>                Accuracy : 0.57          
#>                  95% CI : (0.564, 0.576)
#>     No Information Rate : 0.647         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.088         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 0.628         
#>             Specificity : 0.463         
#>          Pos Pred Value : 0.682         
#>          Neg Pred Value : 0.405         
#>              Prevalence : 0.647         
#>          Detection Rate : 0.406         
#>    Detection Prevalence : 0.596         
#>       Balanced Accuracy : 0.546         
#>                                         
#>        'Positive' Class : weekday       
#> 

According to the result from the original data set (without PCA) in the supervised learning section, sensitivity, specificity, kappa and accuracy are 0.628, 0.479, 0.103 and 0.57 respectively.

From the confusion matrix of the hybrid approach, we find that the sensitivity (0.628) and the accuracy (0.57) are equivalent to the result from logistic regression. However, the specificity (0.463) and kappa (0.088) are slightly lower than the result of logistic regression. Thus, we conclude that this method doesn’t improve the quality of the model.

Conclusion

For the supervised learning model, we use sensitivity, specificity, balanced accuracy, overall accuracy and kappa to compare the quality of different models.

For the unsupervised learning model, we perform clustering and PCA analysis to group a set of instances/features that share some common characteristics. In addition, we also use the result from the PCA analysis to perform a hybrid supervised/unsupervised learning model to see if we can use PCA to improve the performance of the supervised learning model.

Supervised learning

One day of the week - Unbalanced data

According to table1, table2, the overall accuracy and kappa of decision tree and multinomial regression are very low. What’s more, there is a serious unbalance between sensitivity and specificity. The quality of both models is pretty similar and needs to be improved.

Accuracy: decision tree:O.211 multinomial logistic regression:O.213

kappa:decision tree:O.016 multinomial logistic regression:O.01

Table 1 - Decision Tree

decision tree
Accuracy Kappa
overall 0.211 0.016
Sensitivity Specificity Balanced Accuracy
Class: Friday 0.000 1.000 0.500
Class: Monday 0.179 0.822 0.501
Class: Saturday 0.000 1.000 0.500
Class: Sunday 0.882 0.194 0.538
Class: Thursday 0.000 1.000 0.500
Class: Tuesday 0.000 1.000 0.500
Class: Wednesday 0.000 1.000 0.500

Table 2 - Multinomial Logistic Regression

multinomial logistic regression
Accuracy Kappa
overall 0.213 0.01
Sensitivity Specificity Balanced Accuracy
Class: Friday 0.037 0.975 0.506
Class: Monday 0.022 0.981 0.501
Class: Saturday 0.013 0.990 0.502
Class: Sunday 0.961 0.071 0.516
Class: Thursday 0.008 0.995 0.501
Class: Tuesday 0.000 1.000 0.500
Class: Wednesday 0.003 0.997 0.500

One day of the week - balanced data

According to table3 and table4, after we balanced the data and do the cross-validation, the difference between sensitivity and specificity is smaller. Even though the accuracy is a little lower, the kappa of both decision tree and multinomial logistic regression are higher.

Accuracy: decision tree:O.195 multinomial logistic regression:O.176

kappa:decision tree:O.035 multinomial logistic regression:O.03

Table 3 - with cross-validation: Decision Tree

with cross-validation: decision tree
Accuracy Kappa
overall 0.195 0.035
Sensitivity Specificity Balanced Accuracy
Class: Friday 0.000 1.000 0.500
Class: Monday 0.000 1.000 0.500
Class: Saturday 0.199 0.822 0.510
Class: Sunday 0.545 0.576 0.560
Class: Thursday 0.427 0.638 0.532
Class: Tuesday 0.000 1.000 0.500
Class: Wednesday 0.000 1.000 0.500

Table 4 - Multinomial Logistic Regression

multinomial logistic regression
Accuracy Kappa
overall 0.176 0.03
Sensitivity Specificity Balanced Accuracy
Class: Friday 0.070 0.939 0.505
Class: Monday 0.114 0.895 0.504
Class: Saturday 0.121 0.886 0.504
Class: Sunday 0.340 0.759 0.550
Class: Thursday 0.225 0.806 0.515
Class: Tuesday 0.087 0.907 0.497
Class: Wednesday 0.190 0.839 0.514

Weekdays and Weekend-Balanced data with Cross-Validation

According to the table5, the balanced accuracy and kappa is higher after we aggregate the days in a week into weekday and weekends. We will finally choose logistic regression model for it has highest balanced accuracy and kappa.

Table 5 - The scores of three models

The scores of three models (Decison Tree, Random Forest, Logistic Regression)
Sensitivity Specifity Accuracy Balanced_accuracy Kappa
Decison Tree 0.488 0.614 0.533 0.551 0.091
Random Forest 0.609 0.489 0.567 0.549 0.093
Logistic Regression 0.628 0.479 0.575 0.554 0.103

Unsupervised learning

Clustering

Clustering is a method that divides categorized or uncategorized data into similar groups or clusters. We used 2 methods for clustering, including Hierarchical clustering and Partitioning, and chose the number of clusters from statistics. We found that the number of clusters highly depend on the statistics chosen. For example, for the Partitioning method, the optimal number of clusters from Gap statistic is 13, while the optimal number from Silhouette is only 2.

For Hierarchical clustering, we chose 3 as an optimal number of clusters. The main characteristics of each cluster are, for example, high “produce” for cluster 1, high “snacks” and “beverages” for cluster 3. The average numbers of purchases from cluster 1 and 3 are significantly higher than cluster 2 in every department.

For Partitioning methods, we chose 2 as an optimal number of clusters for both K-mean and PAM models. For K-mean, we found that cluster 2 has higher average numbers of purchases than cluster 1 in every department. This means that cluster 2 in K-mean has similar characteristics as cluster 1 and 3 from the Hierarchical clustering. For PAM, the results are very close to the results from K-mean.

PCA

To derive the benefits of PCA and to test our hypothesis that grouping aisles reflects purchasing customer behavior better than grouping departments, we perform PCA by aisles. Non-scaled PCA is our focus to set the environment in line with the supervised learning analysis. As for the result, component1 to component28 (out of 134 components) can explain 75.5% of the total variation of the data. Fresh vegetable, fresh fruits and packaged vegetables fruits are the top three most variance and contributors to component1 and 2. Scaled PCA is also performed and we need 93 components out of 134 components to explain at least 75% of the total variation. Thus, we cannot derive the benefits of PCA and it also shows that each aisle is not correlated. Afterward, hybrid supervised and unsupervised learning approach grabs our attention. The 1-28 components are applied to the best model from supervised learning approach which is the logistic regression model. However, we find that the hybrid method doesn’t improve the accuracy of the prediction.

Limitation

The data set provided by Instacart contains only a few variables, including products purchased per order, product description, day of the week of orders, and hours of orders. We found that this information is still not sufficient to create predictive models with high accuracy for commercial uses. However, Amazon is an ideal case study showing that, with high quality and comprehensive data set, it’s even possible to use machine learning for a predictive analysis to anticipate and ship products before customers actually order it.

We also find the limitation when running machine learning functions in R with too many observations. In our study, we focus only on the training data set, which contains over 100k observations and do not include the prior data set, which represents all historical orders before the training set and contains over a million observations. This is because the execution time is extremely long and there are errors regularly when we run the models with all historical data (prior + training).

Further study

Machine learning and Artificial intelligence have play a key role in the boom of the e-commerce industry. There are a many applications that we can apply machine learning techniques in this industry. For example, Purchase and Repurchase prediction for predicting what orders customers will purchase next, Recommendation system for recommending products that users might would like to purchase. This system is very important behind the success of Amazon and even Netflix. There are also other applications such as Fraud prediction and Marketing campaign. However, in order to build these models with sufficient accuracy for commercial use, it’s important to we need more data such as customer personal information, browsing history, click history to get a better understanding of customer behaviors.

References

Ivo Bernardo, Classification Decision Trees, Easily Explained, Aug 30, 2021

Saikumar Talari, Random Forest® vs Decision Tree: Key Differences, February 18, 2022

Wikipedia, Multinomial logistic regression, 23 August 2021

S. Walusala W, R. Rimiru, C. Otieno, A hybrid machine learning approach for credit scoring using PCA and logistic regression, International journal of computer, ISSN 2307-4523.

Jeremy Stanley, 3 Million Instacart Orders, Open Sourced, May 3, 2017

Bigcommerce Blog, Nick Shaw, Ecommerce Machine Learning: AI’s Role in the Future of Online Shopping

The Economic Time, Amazon may predict and ship your order before you place it, Jan 27, 2014